\chapter{Learning to learn}
\begin{quotation}
\noindent ``\emph{All of the biggest technological inventions created by man -
	the airplane, the automobile, the computer - says little about his 
	intelligence, but speaks volumes about his laziness.  }''
\begin{flushright}\textbf{Mark Kennedy}\end{flushright}
\end{quotation}

\vspace*{0.5cm}

\section{Goals and foundations}
Developing and tuning algorithms to find the optimal strategy to solve a
reinforcement learning problem is hard. Some of the challenges one meets are:
\begin{itemize}
	\item the tradeoff between exploration and exploitation
	\item designing a strategy that allows for versatile training
	\item choosing hyperparameters (learning rate, architecture of a
		neural network,...) \index{hyperparameter} that make the strategy
		optimal considering the reinforcement learning problem at hand.
\end{itemize}

Let us consider the problem of learning a simple bandit problem with dependent
arms. Say, for example, that a bandit has two arms, each of which producing
a reward according to a Bernoulli distribution with the following parameters :
$$ \begin{cases} P(r \mid \text{arm}_1) = p_b \\ 
P(r \mid \text{arm}_2) = 1 - p_b  \end{cases} $$
where $P(r \mid \text{arm}_1)$ is the probability of arm 1 to generate a reward
and $0 \leq p_b \leq 1$ is the parameter of the bandit problem. One way to
solve this problem would be to use the $\epsilon$-greedy strategy (see
Section~\ref{sec:rl_example}). \\

This raises the question about choosing $\epsilon$. Choosing a low $\epsilon$
encourages exploitation but we have a higher chance of wrongly estimating $p_b$.
Choosing a high $\epsilon$ doesn't allow the agent to perform optimally once
its knowledge about the parameters of the problem is likely to be optimal.
Moreover, we could choose a variable $\epsilon$, high at first but slowly
converging to 0, but we then are faced with the choice of another
hyperparameter: the number of steps over which $\epsilon$ should be annealed.\\

One could propose using a more advanced method, and there are many. Multi-armed
bandits have been heavily studied and to this day, several algorithms exist to
solve the kind of bandit problems defined above very quickly - i.e. to explore
just enough to implicitly learn $p_b$ so to exploit the best arm as much as
possible. Examples of these algorithms are the Gittins indices algorithm
\cite{Gittins79banditprocesses},
UCB \cite{Auer:2002:FAM:599614.599677} and Thompson sampling
\cite{thompson1933}.\\

The obvious problem here is that we can always choose to manually design more
and more sophisticated techniques and to tune more perfectly their parameters,
but could we not instead use reinforcement learning to do this for us instead?\\

Recently, Wang et al. \cite{learningtorl} and Duan et al. \cite{fastrlviaslowrl}
proposed the idea of using reinforcement learning to learn an algorithm which
deploys an optimal strategy for a class of problems sharing a similar structure.
We will use the name "meta-RL" or meta-learning for this technique, as proposed 
by Wang et al. \\
\begin{figure}
	\centering
	\includegraphics[width=0.7\linewidth]{fig/normal_bandit_training.eps}
	\caption{A classic training sequence to learn a bandit problem. One
	manually designs a strategy which inspects the reward obtained at 
	the end of an episode and evaluates accordingly the performance of the action
	taken. The strategy then updates the policy inbetween episodes.}
	\label{fig:normal_bandit_training}
\end{figure}

In classic reinforcement learning (see Figure~\ref{fig:normal_bandit_training}),
\textbf{the network represents a policy.} Its goal is to maximise its reward
for each episode, and it is trained by looping the following steps:
\begin{enumerate}
	\item let the network play an episode
	\item update the network using a hand-designed strategy (Gittins, UCB,
		Q-learning \cite{qlearning}, SARSA \cite{sarsa}, ...) to 
		increase the total reward of an episode
\end{enumerate}

As stated at the beginning of this chapter, designing strategies to learn
policies is a hard task and can demand a sensitive tuning of parameters.
In meta-RL, \textbf{the network represents an algorithm which learns a policy}
Its goal is to be the best possible learning strategy, which is why we train
it differently:
\begin{enumerate}
	\item let the network play a \textbf{trial} \index{trial} of several
		episodes of the same problem
	\item update the network so that it learns faster
\end{enumerate}
In this context, the network has to maximise its expected reward across
\textbf{all} episodes of a trial (see 
Figure~\ref{fig:meta_bandit_training}). Note that a trial replicates the setting
of learning to solve a problem -- the agent is not learning to solve a 
problem anymore, it is learning to learn how to solve a problem, but in 
a very low number of episodes rather than the many episodes needed 
to train a standard reinforcement learning agent. This will incentivize the agent
to understand the structure of a problem and develop a strategy that allows it 
to estimate the parameters of the problem as fast as possible.
Generally, we will call the learned policy the inner 
algorithm, and the algorithm that learns the policy the outer algorithm.\\


\begin{figure}
	\centering
	\includegraphics[width=\linewidth]{fig/meta_bandit_training.eps}
	\caption{The training sequence for a meta-RL agent on a bandit problem.
	In this case, the agent is left to play several episodes on its own
	without changing its weights. It is only when a trial ends that the
	outer algorithm evaluates the rewards of all episodes in the trial and
	updates the weights of the inner algorithm so that it increases its
	chances of having a better reward across all episodes. For trials of 
	three episodes such as the one presented in this figure, the outer
	algorithm forces the inner algorithm to learn the problem in 
	three episodes.}
	\label{fig:meta_bandit_training}
\end{figure}

In our manually designed strategies, we use previous actions and rewards
to update the policy and make it better. Similarly, in order to
make meta-RL work, the agent has to receive as input the previous reward
and the action that led to that reward, but it also needs to carry on some
sort of memory of past actions and rewards. We do this manually in 
$\epsilon$-greedy by keeping track of the average reward yielded by each arm.
In the case of meta-RL, this is done by using a recurrent network which
carries a hidden state from the first episode in the trial to the last.\\


\begin{figure}
	\centering
	\includegraphics[width=0.2\linewidth]{fig/a2c_meta.eps}
	\caption{The meta-learning A2C agent. The key difference with a
	standard A2C agent is that it receives the values 
	$[a_{t-1},r_{t-1},d_{t-1}]$ in addition to the state observation. It
	is of course also trained in a different way to a standard A2C agent.}
	\label{fig:a2c_meta}
\end{figure}



The general setup presented by Wang et al. uses a recurrent neural network of
which the weights, once trained "slowly" over several trials of multiple
episodes, will encode the "fast" learning policy.
Figure \ref{fig:a2c_meta} shows the meta-learning A2C agent presented in Wang
et al. \cite{learningtorl}. There are two differences to note compared to
Figure~\ref{fig:a2c}: the first one is the recurrent connection, and the
second one is the set of values we input to the agent at each time step. In
addition to the state of the environment, we add the previous action, 
previous reward and a termination flag which is set to 1 when an episode
has reached a terminal state; 0 otherwise.\\

\begin{figure}[H]
	\centering
	\includegraphics[width=0.6\linewidth]{fig/meta_rl_timestep.eps}
	\caption{An unrolled timestep in the meta-RL training setting. At each
	timestep, the agent receives an observation of the state of the
	environment as well as the previous action taken, the associated reward,
	and the termination flag. It also receives its previous hidden state
	if and only if the previous timestep was part of the same trial (this
	could still mean that the last timestep was from a different episode
	which just terminated)}
	\label{fig:meta_rl_timestep}
\end{figure}


It is important to emphasize on the fact that the agent passes on its hidden
state between different episodes of the same trial (and if episodes last for
more than one timestep, between timesteps as well), but \textbf{not} between
trials (see Figure~\ref{fig:meta_bandit_training}).
The reason why the inter-episode connection is needed is because
the agent might want to deploy a different policy given the results of 
previous episodes as it has to optimise its expected reward across multiple
episodes.

\section{Agent architecture}
\begin{figure}
	\centering
	\includegraphics[width=0.7\linewidth]{fig/agent_architecture.eps}
	\caption{Architecture of the meta-learning A2C agent}
	\label{fig:agent_architecture}
\end{figure}
We use the same A2C agent that Wang et al. \cite{learningtorl} have used
for their bandit experiments. It is described in Figure~\ref{fig:agent_architecture}.
The input vector is a concatenation of the state observation, the previous
action taken, the reward obtained at the previous timestep and the termination
flag :
$$ i = [s_t, a_{t-1}, r_{t-1}, d_{t-1}] $$
The action taken is converted to a one-hot encoding (converting the action
index into a vector of length $|\mathcal{A}|$ containing zeros except a single
1 at the index of the action). This vector is then fed to a LSTM sized at 48
hidden units, followed by two branches :
\begin{enumerate}
	\item a fully connected softmax layer with $|\mathcal{A}|$ units (the policy
		output). Actions are selected by sampling according to the
		distribution defined by the policy, rather than by taking the
		action with the highest probability.
	\item a fully connected linear layer with 1 unit (the value output).
\end{enumerate}
The loss of the A2C agent is the following : 
$$ \mathcal{L} = \beta_v \mathcal{L}_v + \beta_p \mathcal{L}_p - \beta_e 
 \mathcal{L}_e $$
where $\mathcal{L}_v$ is the value loss : 
$$ \mathcal{L}_v = (R_{t-1} - V(s_t))^2$$
$\mathcal{L}_p$ is the policy loss : 
$$ \mathcal{L}_p = \pi(a_t \mid s_t) (R_t - V(s_t))$$
$\mathcal{L}_e$ is the entropy regularisation : 
$$ \mathcal{L}_e = \sum\limits_{a \in A}\pi(a \mid s_t)\log(\pi(a \mid s_t))$$
and $\beta_v = 0.5$, $\beta_p = 1$, $\beta_e = 0.05$. An update is performed
once after every trial using Adam \cite{adam} with a learning rate of 0.001.\\

Unless stated, the discount factor used in all experiments is $\gamma=0.9$. 


\section{Meta-learning dependent bandits}
Let us come back to the problem stated previously as an example. This will
prove useful as it is the same setting used by Wang et al. \cite{learningtorl}.
We can then compare our results with theirs to introduce the current state
of the art, test our implementation and verify that their results can 
be reproduced. This setting has the advantage of being relatively simple
so it will help us understand meta-learning before jumping to a harder problem.\\

The setting is the following: a dependent
bandit with 2 arms of which the reward distribution is a Bernoulli distribution
with the following parameters : 
$$ \begin{cases} P(r \mid \text{arm}_1) = p_b \\ 
P(r \mid \text{arm}_2) = 1 - p_b  \end{cases} $$

We define a training distribution of dependent problems with
$p_b \in [0.1, 0.2, 0.8, 0.9]$, creating the following set of dependent
bandit problems:
\begin{table}[H]
	\centering
	\begin{tabular}{c|c}
		Arm \#1 & Arm \#2 \\ \hline
		0.1 & 0.9 \\ \hline
		0.2 & 0.8 \\ \hline
		0.8 & 0.2 \\ \hline
		0.9 & 0.1
	\end{tabular}
\end{table}

The agent will play trials of 100 episodes - meaning that we choose one bandit
problem and let the agent pull either one of its arms 100 times, then reset
the hidden state and start again. Since the problem is stateless, the agent
only receives the last action taken, the previous reward and the termination
flag as input. The discount factor $\gamma$ is set to 0.8 to replicate 
Wang et al.'s experiment \cite{learningtorl}.\\

Wang et al. \cite{learningtorl} achieved creating a meta-learning
agent that could learn a dependent problem as the one defined above in only 
a few episodes (one episode being one pull of the bandit). For some experiments,
their agent outperforms the state of the art; and a look over the behaviour
of their agent shows only a minimal number of episodes spent exploring before
exploiting only.\\

Figure~\ref{fig:bandit_reward} shows that after some time, the agent figures out
a way to yield an excellent reward over all trials: for trials using bandits
with $p_b \in [0.2, 0.8]$, even if the agent pulled the best arm for
each episode (which rules out exploration completely), the total reward
expectation is 80, and for trials using bandits with $p_b \in [0.1, 0.9]$, 
the total reward expectation is 90. Scoring an average trial reward between
80 and 85 looks very close to optimal, if not fully optimal.\\

\begin{figure}[H]
	\centering
	\includegraphics[width=0.75\linewidth]{fig/bandit_reward.pdf}
	\caption{Evolution of the smoothed reward of trials during training on
	dependent bandit problems}
	\label{fig:bandit_reward}
\end{figure}

If we look at the evolution of decisions taken by the agent during training
on Figure~\ref{fig:bandit_optimality}, and especially on
Figure~\ref{fig:bandit_optimality}b, we see that the agent evolves from playing
randomly each arm to testing both arms for just a few episodes at the start
of the trial to then always play what it judges is the best episode.\\

\begin{figure}[H]
	\centering
	\subfloat[][Pulled arm]{
		\includegraphics[width=0.49\linewidth]{fig/bandit_pulls.png}}
	\subfloat[][Optimality of the pulled arm]{
		\includegraphics[width=0.49\linewidth]{fig/bandit_optimality.png}}
	\caption{Insight into the evolution of which arms get pulled and 
	whether they are optimal during training. On the left, a black dot
	represents the fact that arm \#1 has been pulled and a white dot
	represents the fact that arm \#2 has been pulled. On the right, a white
	dot signifies that the best arm has been pulled. This figure shows
	that after a bit of training, the agent learns to quickly identify
	which arm is the optimal arm and pulls that arm only after only 
	a few episodes.}
	\label{fig:bandit_optimality}
\end{figure}

We should emphasize on the importance of these results. If we had used the 
$\epsilon$-greedy method as is, the difference in performance would be 
significant on several aspects:
\begin{itemize}
	\item with a fixed $\epsilon$, sticking to the optimal arm would have
		been impossible (as a random action is required from time to 
		time), hindering performance while it can be very clear
		which arm is the best
	\item even if we chose a small $\epsilon$ to minimise performance loss
		in the exploitation phase, we would have increased dramatically
		the probability of not exploring enough at the start of the 
		episode
	\item similar reasoning can be used to argue about the number of 
		trials over which $\epsilon$ could have been annealed from
		1 to 0.
\end{itemize}
Using meta-learning allowed us to bypass all these design considerations.
Furthermore, both Wang et al. \cite{learningtorl} and Duan et al. 
\cite{fastrlviaslowrl} show that the strategy deployed by the meta-learning
agent matches (or outperforms) the best hand-designed strategies such as 
UCB and Gittins.\\

It is worth noting however that we have only been considering known and seen
bandit problems in our discussion so far. Nevertheless, we see 
on Figure~\ref{fig:bandit_test} that the meta-learning agent can generalise to
all other values of the parameter $p_b$ with very good performance. For the 
performance analyis of Figure~\ref{fig:bandit_test}, we let the agent play 
100 trials of 20 equally spaced values of $p_b$ in the range $[0.5, 0.975]$.
For each trial, the optimal arm was randomly shuffled to be either the first
or second arm. We compare the average reward obtained to the optimal expected
reward (the theoretical expected reward obtained by exclusively pulling the
optimal arm, which is not practically possible since the agent doesn't know
which arm is the optimal arm and has to explore for several episodes).\\

The agent performs very well on unseen values of $p_b$, even harder ones which
are closer to 0.5, even though it has only ever seen $p_b = 0.8$ and $p_b = 0.9$.

\begin{figure}[H]
	\centering
	\includegraphics[width=0.8\linewidth]{fig/bandit_test.pdf}
	\caption{Average performance of the meta-learning agent on the whole
	range of the parameter $p_b$. The agent has only seen
	$p_b \in [0.8, 0.9]$. The performance is compared to the theoretical
	expected reward if the agent selected the optimal arm for each episode.}
	\label{fig:bandit_test}
\end{figure}


\subsubsection{Summary}
Now that we have laid out the stakes and goals of meta reinforcement learning,
explained the training process and the architecture of the agent,
and recreated state of the art experiments to verify both our implementation
and the results obtained in the literature; we will extend the application
of meta learning to a new class of problem. We will derivate a distribution
of tasks from the CartPole problem and analyse the performance and dynamics
of meta-learning applied to this case.


